Introduction to supervised machine learning

Ben Lambert

Material covered

  • linear and logistic regression
  • how to train a ML model via gradient descent
  • what do over- and under-fitting mean?

Linear regression

Example house price versus size data

Example model: house price and size

\[\begin{equation} H_i = a + b S_i + \epsilon_i \end{equation}\]

where \(H_i\) is the house price; \(S_i\) is the size in square feet; and \(\epsilon_i\) is an error term.

What does this model look like?

What are \(\epsilon_i\)?

Choice of loss function

Choose a model to minimise the sum of squared errors, that is:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} \epsilon_i^2 \end{equation}\]

which is the same as choosing values of \(a\) and \(b\) to minimise:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i))^2 \end{equation}\]

Least-squares regression line

What about other loss functions?

Least absolute deviation loss:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} |\epsilon_i| \end{equation}\]

Quartic power loss:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} \epsilon_i^4 \end{equation}\]

Other regression lines

How do we estimate parameters?

Suppose we want to minimise a least-squares loss function:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i))^2 \end{equation}\]

Choose \(a\) and \(b\) to minimise this loss \(\implies\) differentiate!

Learning parameters

determine \(\hat{a}\) and \(\hat{b}\) as those minimising \(L\):

\[\begin{align} \frac{\partial L}{\partial a} &= -\frac{2}{K}\sum_{i=1}^{K} (H_i - (a + b S_i)) = 0\\ \frac{\partial L}{\partial b} &= -\frac{2}{K}\sum_{i=1}^{K} S_i (H_i - (a + b S_i)) = 0 \end{align}\]

General solution of loss-minimisers

For general loss functions, no solution exists. That is, usually equations like:

\[\begin{align} -\frac{2}{K}\sum_{i=1}^{K} (H_i - (a + b S_i)) &= 0\\ -\frac{2}{K}\sum_{i=1}^{K} S_i (H_i - (a + b S_i)) &= 0 \end{align}\]

have no solution. (Here, they actually do.)

Gradient descent

Instead of solving equations directly, we use gradient descent optimisation

  1. initialise parameters \(a=a_0\), \(b=b_0\)
  2. in each epoch update parameters:

\[\begin{align} a &= a - \eta \frac{\partial L}{\partial a}\\ b &= b - \eta \frac{\partial L}{\partial b} \end{align}\]

until \(a\) and \(b\) no longer change. \(\eta\) is the learning rate

Gradient descent: initial point

Gradient descent: first step

Gradient descent: next step

Gradient descent: convergence

Gradient descent in 2D

Too high learning rate

Too low learning rate

Making the model more complex

\[\begin{equation} H_i = a + b S_i + c S_i^2 + \epsilon_i \end{equation}\]

What does this model look like?

Quadratic regression line

How to estimate model parameters?

Least-squares loss function:

\[\begin{equation} L = \frac{1}{K} \sum_{i=1}^{K} (H_i - (a + b S_i + c S_i^2))^2 \end{equation}\]

Then use gradient descent

\[\begin{align} a &= a - \eta \frac{\partial L}{\partial a}\\ b &= b - \eta \frac{\partial L}{\partial b}\\ c &= c - \eta \frac{\partial L}{\partial c} \end{align}\]

What about more complex models?

\[\begin{equation} H_i = a + b S_i + c S_i^2 + d S_i^3 + ... \epsilon_i \end{equation}\]

More complex model

Which model is best?

Original data

New data

What went wrong?

  • adding more parameters always reduces error on training set
  • but results in a model that generalises poorly
  • all else being equal simple models are better

What is a good fitting model?

How to get to a “fit” model?

Hold out a separate validation set to test model predictions on

Using validation set to determine optimal model complexity

Will a trained model perform as well in real life?

No:

  • use of validation set means that we effectively overfit to it
  • thus create a separate testing set that is used only once to gauge performance

Create test set

Linear regression summary

  • linear regression defines a loss function (typically mean squared error) between actual and predicted observations
  • training can be done via gradient descent: each epoch corresponds to a single parameter update
  • (gradient descent also used to train many other methods, like neural nets)
  • fitting regression with more complex functional forms can fit more complex data
  • but risks poor generalisation

Questions?

Logistic regression

  • confusingly, it is a classifier not a regression (in the ML sense)
  • it is used to create models to classify binary data
  • very common across ML and statistics

Example data

Cancer and coins

Denote:

  • \(y_i=0\) indicates cancer-free
  • \(y_i=1\) indicates presence of lung cancer

Two outcomes: so is like flipping a coin!

How to model these data?

Coin flipping distribution

Denote:

  • \(\text{Pr}(y_i=0) = 1-\theta\)
  • \(\text{Pr}(y_i=1) = \theta\)

where \(0\leq \theta \leq 1\). We can represent in the following probability distribution:

\[\text{Pr}(y_i=y) = \theta^y(1-\theta)^{1-y} \]

which is known as the Bernoulli distribution:

\[\begin{equation} y_i \sim \text{Bernoulli}(\theta) \end{equation}\]

How to estimate coin bias?

Suppose you flip coin twice: \(y_1=1\), \(y_2=0\). Assuming independence:

\[\begin{equation} \text{Pr}(y_1=1,y_2=0|\theta) = \theta \times (1-\theta). \end{equation}\]

We call \(L(\theta)=\text{Pr}(y_1=1,y_2=0|\theta)\) a likelihood.

Maximum likelihood estimation

Want to choose \(\theta\) to maximise probability of obtaining those results

Derivatives

Find maximum by differentiating:

\[\begin{equation} \frac{d L}{d\theta} = 1 - 2 \theta = 0 \end{equation}\]

Rearranging, we obtain:

\[\begin{equation} \theta = \frac{1}{2} \end{equation}\]

How to model these data?

Biased coins

How to model bias?

In logistic regression, we use logistic function:

\[\begin{equation} \theta = \frac{1}{1 + \exp (-x)} \end{equation}\]

Logistic regression

We want to estimate how sensitive presence / absence of lung cancer is to tar, so model probability:

\[\begin{equation} \theta_i = f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_i))} \end{equation}\]

which is known as logistic regression and assume:

\[\begin{equation} y_i \sim \text{Bernoulli}(\theta_i) \end{equation}\]

How to estimate \(\beta_0\) and \(\beta_1\)?

Data for one individual \((x_i,y_i)\) has probability:

\[\text{Pr}(y_i=y) = f_\beta(x_i)^y(1-f_\beta(x_i))^{1-y} \]

Likelihood for two individuals’ data

Suppose we have data \((x_1,y_1=1)\) and \((x_2,y_2=0)\).

Assume data are:

  • independent
  • identically distributed

Then overall probability is just product of individual:

\[\begin{array} L &= f_\beta(x_1)^{y_1} (1-f_\beta(x_1))^{1-y_1} f_\beta(x_2)^{y_2}(1-f_\beta(x_2))^{1-y_2}\\ &= f_\beta(x_1) (1-f_\beta(x_2)) \end{array}\]

Likelihood for larger datasets

Same logic applies under i.i.d. assumption:

\[\begin{equation} L = \prod_{i=1}^{K} f_\beta(x_i)^{y_i} (1 - f_\beta(x_i))^{1 - y_i} \end{equation}\]

Maximum likelihood estimation

Unlike the simple coin flipping case, there is no analytic solution to the maximum likelihood estimates. Instead, do gradient descent:

\[\begin{align} \beta_0 &= \beta_0 - \eta \frac{\partial L}{\partial \beta_0}\\ \beta_1 &= \beta_1 - \eta \frac{\partial L}{\partial \beta_1} \end{align}\]

where \(\eta>0\) is the learning rate.

How to interpret model estimates?

Suppose we estimate that \(\beta_0=-1\) and \(\beta_1=2\). What do these mean?

\[\begin{equation} \theta_i = \frac{1}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]

so impact of incremental changes in \(x_i\) on the probability of lung cancer is nonlinear

Nonlinear impact

Can we find an interpretation?

\[\begin{equation} \theta_i = \frac{1}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]

meaning

\[\begin{equation} 1-\theta_i = \frac{\exp (-(-1 + 2 x_i))}{1 + \exp (-(-1 + 2 x_i))} \end{equation}\]

Calculate odds

The ratio of probability of lung cancer to probability of cancer-free is called odds:

\[\begin{align} \frac{\theta_i}{1-\theta_i} &=\exp (-1 + 2 x_i) \end{align}\]

so here \(\exp 2\approx 7.4\) gives the change to the odds for a one unit change in x_i. Because of this, \(\exp \beta_1\) is known as the odds ratio for that variable

Odds ratios

  • if \(\beta_1 > 0\), the odds ratio \(>1\), which indicates that changes to a variable increase the probability of the \(y_i=1\) event occuring.
  • vice versa for \(\beta_1 < 0\).

Log odds interpretation

Taking log of both sides:

\[\begin{equation} \log \frac{\theta_i}{1-\theta_i} = -1 + 2 x_i \end{equation}\]

so we see that \(\beta_1=2\) effectively gives the change to the log-odds for a one unit change in \(x_i\).

Multivariate logistic regression

straightforward to extend the model to incorporate multiple regressions:

\[\begin{equation} f_\beta(x_i) := \frac{1}{1 + \exp (-(\beta_0 + \beta_1 x_{1,i} + ... + \beta_p x_{p,i}))} \end{equation}\]

Logistic regression summary

  • logistic regression models are binary classifiers (in ML speak)
  • assumes Bernoulli distribution for outputs
  • logistic function used to relate changes in inputs to outputs
  • multivariate logistic regression is a commonly used tool

How to learn more?

All available on SOLO:

  • “The hundred-page machine learning book”, Burkov
  • “Hands-On machine learning with Scikit-Learn & Tensorflow”, Geron

Coursera:

  • Data Science: Statistics and Machine Learning Specialization, Johns Hopkins